Business Analytics of Convenient Food Mart¶

1. Introduction¶

Today, data and its application are growing at an exponential rate. Merging with business objectives, it can unravel new opportunities and possibilities in various realms. One such sector, customer acquisition and retention, benefits greatly from diverse and voluminous data. Proposing to analyze and interpret the existing customer base from a chain of convenience stores for this project, we intend to infer the store profitability relative to demographic elements, customer preferences, and effective marketing strategy thereby, contributing significantly to the commercial success of each store. We aim to help the management of Convenient Food Mart by doing various analysis mentioned below which would help them to drive their business with more profits and sustainability. We have considered the factors which would not just focus on the store attributes, but also focus on the customer behaviour while they spend time in our stores. As we all know sustainability can help businesses to reduce their cost and improve the efficiency of operations, we have suggested few ways in which the management can improve the sustainability of the stores.

1.1. What analysis and why?¶

We would be analyzing and deriving following insights from the dataset -

  • What is the proportion of stores that are using recyclable packages? In which area (city/country) is it used most frequently?

            recyclable_package is a categorical column which tells us whether a store uses recyclable package or not for a specific good and
            it's corresponding brand. Using the data present in this column, we have derived the relevant insights. This insights would be important
            for management to decide the level of sustainability they already have, and how they can improve.

  • Profitability of store based on store features. In other words, what kind of store appears to be most profitable?

             Is inconsistency always bad? We doubt that! The beauty of our dataset is that all stores from different state and cities have different
             featuers. These features are related to the store properties, like store size, store type, etc. Based on the relevant features, we'll observe
             and infer the profitable stores across various regions in the US. We have also extened this analysis based on the location. CFM operates
            on the franchise system, hence, it has stores all over United States. But each store has different properties which determines
            the profits. Using the data present, we'll research about the factors which are driving stores to perform the way they are based on their
            demographic locations.

  • Which is the most effective media type for promotion?

            promotion_name is a categorical feature in our dataset which tells us, how various promotion campaigns are driven on certain goods of the
            corresponding brand. These campaigns are carried out with a media (like TV, radio, etc). We'll research the popular promotion and the effective
            media-type based on the number of purchases and the profits earned across various regions in the US.

  • What is the customer behavior based on gender and marital status?

            This dataset contains sufficient features to analyze the customers and their shopping preferences. For example, which customers (with or
            without kids) are prone to buy certain membership of the store. By looking at the relevant featuers, we'll derive insights and
            inferences on customer behavior. We'll also extend this analysis by probing into two pairs of correlation, one between food category
            and store location, the other between food category and store location. Here, we'll analyze how consumer preference for food is affected
            based on the location, and which food type is popular and based on the customer martial status, what food categories can be pushed ahead
            to increase the sales.

  • Predicting customer annual income

            CFM operates on the franchise system, hence, it has stores all over United States. But each store has different properties which determines
            the profits. Using the data present, we'll research about the factors which are driving stores to perform the way they are based on their
            demographic locations.

  • Predict cost to acquire the customer based on customer attributes and based on store attributes.

            Using the categorical and numerical variables we'll try to predict the cost for
            acquiring customers.

Transforming this enormous information into quality data, we thereby derive actionable intelligence to enable informed strategies and optimize commercial engagements.

The libraries used for python implementation are -

  • pandas (documentation)
  • seaborn (documentation)
  • matplotlib(documentation)
  • numpy (documentation)
  • plotly (documentation) Note: We will keep adding libraries to the above list as we keep implementing further

1.2. Our Dataset¶

Our dataset consists of over 60,000 records and 40 features of customers' details, their shopping preferences and store details of a Convenient Food Mart (CFM). CFM is a chain of convenience stores in the United States (also located in some states of Mexico and Canada). The private company's headquarters are located in Mentor, Ohio, and there are currently approximately 325 stores located in the US. Convenient Food Mart operates on the franchise system. Our scope of the project includes analysis only of the stores based in US.

Data source: Kaggle

In [1]:
# importing required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import numpy as np

import plotly.graph_objects as go
import plotly.express as px

import warnings

from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.preprocessing import OrdinalEncoder
from sklearn import cluster
from sklearn import decomposition
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

import scipy.stats as stats

import re
import random

import statsmodels.api as sm
import scipy.stats as ss

warnings.filterwarnings('ignore')
In [2]:
data = pd.read_csv(r'C:\Users\Chaconne\Documents\学业\UMD\Assignments\704_Python\Group Project\FinalNB\media prediction and its cost.csv')
print(f'Inital shape {data.shape}')
data.head()
Inital shape (60428, 40)
Out[2]:
food_category food_department food_family store_sales(in millions) store_cost(in millions) unit_sales(in millions) promotion_name sales_country marital_status gender ... grocery_sqft frozen_sqft meat_sqft coffee_bar video_store salad_bar prepared_food florist media_type cost
0 Breakfast Foods Frozen Foods Food 7.36 2.7232 4 Bag Stuffers USA M F ... 18670 5415 3610 1 1 1 1 1 Daily Paper, Radio 126.62
1 Breakfast Foods Frozen Foods Food 5.52 2.5944 3 Cash Register Lottery USA M M ... 18670 5415 3610 1 1 1 1 1 Daily Paper, Radio 59.86
2 Breakfast Foods Frozen Foods Food 3.68 1.3616 2 High Roller Savings USA S F ... 18670 5415 3610 1 1 1 1 1 Daily Paper, Radio 84.16
3 Breakfast Foods Frozen Foods Food 3.68 1.1776 2 Cash Register Lottery USA M F ... 18670 5415 3610 1 1 1 1 1 In-Store Coupon 95.78
4 Breakfast Foods Frozen Foods Food 4.08 1.4280 3 Double Down Sale USA M M ... 18670 5415 3610 1 1 1 1 1 Radio 50.79

5 rows × 40 columns

In [3]:
data.columns
Out[3]:
Index(['food_category', 'food_department', 'food_family',
       'store_sales(in millions)', 'store_cost(in millions)',
       'unit_sales(in millions)', 'promotion_name', 'sales_country',
       'marital_status', 'gender', 'total_children', 'education',
       'member_card', 'occupation', 'houseowner', 'avg_cars_at home(approx)',
       'avg. yearly_income', 'num_children_at_home',
       'avg_cars_at home(approx).1', 'brand_name', 'SRP', 'gross_weight',
       'net_weight', 'recyclable_package', 'low_fat', 'units_per_case',
       'store_type', 'store_city', 'store_state', 'store_sqft', 'grocery_sqft',
       'frozen_sqft', 'meat_sqft', 'coffee_bar', 'video_store', 'salad_bar',
       'prepared_food', 'florist', 'media_type', 'cost'],
      dtype='object')
In [4]:
data['sales_country'].value_counts()
Out[4]:
USA       38892
Mexico    17572
Canada     3964
Name: sales_country, dtype: int64

2. Data Cleaning¶

2.1. Selecting US Data¶

A quick look in the data shows that it contains data about 3 countries - USA (65%), Mexico (29%), and Canada (6%).
To draw relevant insights from the data, we have decided to go with US data only. This will help us compare the performance of stores in the US. It also makes analysing customer behaviour easier if we focus on one country.
We see that out of about 60,000 customers, 40,000 are based in the US. We feel that this is more than sufficient to make high quality analyses and also draw meaningful conclusions from them. So, we go ahead and drop the rows pertaining to the other countries.

In [5]:
# keeping only US specific data
data = data[data['sales_country'] == 'USA']

2.2. Dropping irrelevant columns¶

Since our analysis is primarily based on stores and their customers, and we already have a lot of columns, we can safely remove some columns that may intuitively feel irrelevant. Moreover, our final aim is to come up with algorithms to predict customer acquisition costs, so there may be quite a few columns that do not have any relation with that aspect of our analysis.
After looking at all the columns, we can identify a few columns which we deem unnecessary for our assesment.
1) There are two columns about the number of cars a customer owns, so we drop the duplicate column 2) The gross and net weight of a product are details of a product and have no relevance to the customer acquisition cost and so canbe dropped. They might have been useful for inventory analysis, but that is beyond the scope of this project. The same is the case for units per case, which is also dropped. 3) Whether a product is low fat or not may be used in studying customer behaviour but if only if all the products are consumable food items, which isn't the case, so we can drop this column. 4) Since we have already filtered the data for US specific data, having a column for sales country is redundant and it needs to be removed from the dataset.

In [6]:
#deleting redundant columns
del data['avg_cars_at home(approx).1']
del data['gross_weight']
del data['net_weight']
del data['low_fat']
del data['units_per_case']
del data['sales_country']
In [7]:
data.shape
Out[7]:
(38892, 34)

2.3. Re-indexing¶

To make referencing and interpretation easier, we need to properly index the dataset. To this end, we have chosen a multi-level index, first by store state and then by city.

In [8]:
# setting index at state and city level
data = data.set_index(keys=['store_state','store_city']).sort_index(level=[1,0])

3. Data Processing or Data Analysis?¶

Our project mainly classify as Data Analysis, since, we have performed analysis on various columns of our dataset in-order to acheive our aim of helping the management of CFM. All six tasks are focused on the analysis. Data cleaning is the only task which we have implemented which involves data processing.

4. Data Analysis¶

4.1. Number of stores with recyclable packages at different locations in United States¶

Although all stores in several states derive considerable profit, it is equally important to align retail operations with sustainiability objectives.
Analysing stores in each state by their production of recyclable packages such as paper bags or cloth sacks, we intend to idenitfy the stores which lack environmental responsibilities thereby, suggesting effective strategies to achieve this goal

A data visualization graph below shows the number of stores commited to environmental sustainability for each state CA, OR and WA. Observation and inference from graph is provided after visualization.

In [9]:
#Number of stores which provides recyclable packages across each state
number_of_recyclable_packages_per_state = (data["recyclable_package"]).groupby("store_state").value_counts()[:,1]
number_of_recyclable_packages_per_state
Out[9]:
store_state
CA     4918
OR     5927
WA    10817
Name: recyclable_package, dtype: int64
In [10]:
#Number of stores which does not provide recyclable packages across each state
number_of_non_recyclable_packages_per_state = (data["recyclable_package"]).groupby("store_state").value_counts()[:,0]
number_of_non_recyclable_packages_per_state
Out[10]:
store_state
CA    3976
OR    4701
WA    8553
Name: recyclable_package, dtype: int64
In [11]:
#Plotting graph of total stores across each state being environmental friendly
graph = number_of_recyclable_packages_per_state.plot.bar(color=['lightgreen', 'lime', 'green'])
plt.xlabel("State")
plt.ylabel("Number of Stores")
plt.title("Total stores providing recyclable packages by state")
plt.xticks(rotation = "horizontal")
plt.show()

Observation: From the graph, it can be observed that the stores in Washington invests signifcantly in the production of recyclable packages whereas both Oregon and California are trailing behind in sufficient mnaufacture of the same.

Inference: The stores in California and Oregon should further align their business objectives with ecological preservation. For instance, they can invest in low-cost alternatives such as paper bags or encourage customers to bring their own bag to the stores.

With retail sector being a totally consumer segment, number of transactions between company and customer remains high. With each transactions and physical product being sold, shopping bag had to be provided. Not all stores commits and provides recyclable packages, but many do. Number however, varies across cities. Analysis here shows how many stores provides recyclable packages across cities.

Below analysis with visualization shows no of stores in a particular city which does and does not provide recyclable packages. More so ever, it was believed prior to research that there might not be enough stores which provides recyclable packages as data is quite old. Surprisingly, there are more stores providing recyclable materials as compared to stores not providing them.

In [12]:
#Total stores with recyclable packages across each city
number_of_recyclable_packages_per_city = (data["recyclable_package"]).groupby("store_city").value_counts()[:,1]
number_of_recyclable_packages_per_city
Out[12]:
store_city
Bellingham        426
Beverly Hills    2280
Bremerton        1971
Los Angeles      2208
Portland         2893
Salem            3034
San Francisco     430
Seattle          2813
Spokane          2449
Tacoma           3158
Name: recyclable_package, dtype: int64
In [13]:
#Total stores in each city which does not provide recyclable packages 
number_of_non_recyclable_packages_per_city = (data["recyclable_package"]).groupby("store_city").value_counts()[:,0]
number_of_non_recyclable_packages_per_city
Out[13]:
store_city
Bellingham        285
Beverly Hills    1871
Bremerton        1480
Los Angeles      1752
Portland         2257
Salem            2444
San Francisco     353
Seattle          2238
Spokane          2004
Tacoma           2546
Name: recyclable_package, dtype: int64
In [14]:
#Plotting graph with total stores seen as environmental sustainable across each city
city_analysis = sns.barplot(y=number_of_recyclable_packages_per_city.index, x = number_of_recyclable_packages_per_city.values)
city_analysis.set( ylabel = 'City', xlabel='Number of Stores')
city_analysis.set_title('Total stores providing recyclable packages by city')
plt.show()

Observation: Most stores in the cities of Washington State are environmentally inclined in their package utility barring Bellingham. Furthermore, although only a few cities in Oregon have these retail stores, a significant proportion of them use sustainable bags. Conversely, the stores in the urban regions of California have the least utility of recyclable bags.

Inference: Stores in the cities of California, particularly San Francisco, can partner with local manufacturers of sustainable packages to enable both economical and efficient mass production. Additionally, they can introduce incentives such as substantial discounted prices to customers who carry their own eco-friendly bag from home.

4.2. Most profitable stores at different locations in United States¶

Here, we are trying to analyse the average profits obtained by the stores at various states and cities in the United states and make a few inferences from the available data.

4.2.1. Deriving average profits:¶

First, we obtain the profits for each store within the cities of various states in the United states and store it in a dataframe.

In [15]:
# yield the profit column
data['Profit'] = data['store_sales(in millions)'] - data['store_cost(in millions)']

4.2.2. Obtaining the average store size¶

Next, we find the average size of various stores, along with the average shelf spaces alotted for various sections for multiple types of stores.

In [16]:
# Calculating the average store-size 
df_ProfitVariables = data[['store_type','store_sqft', 'grocery_sqft', 'frozen_sqft', 'meat_sqft']].groupby(by = 'store_type').agg(func = 'mean').round(3)

4.2.3. Finding the correlations:¶

Initially, let's try to find any relationships between the average profits and any other factors that may influence the profits.

In [17]:
# retrieve variables which may have correlation with profit
df_ProfitVariables = data[['Profit','unit_sales(in millions)', 'store_type', 'store_sqft', 'brand_name', 'grocery_sqft', 'frozen_sqft', 'meat_sqft']]
df_ProfitVariables.corr().round(3)
Out[17]:
Profit unit_sales(in millions) store_sqft grocery_sqft frozen_sqft meat_sqft
Profit 1.000 0.496 0.007 -0.007 0.023 0.023
unit_sales(in millions) 0.496 1.000 0.020 -0.006 0.050 0.050
store_sqft 0.007 0.020 1.000 0.923 0.867 0.867
grocery_sqft -0.007 -0.006 0.923 1.000 0.609 0.609
frozen_sqft 0.023 0.050 0.867 0.609 1.000 1.000
meat_sqft 0.023 0.050 0.867 0.609 1.000 1.000

Examining further......¶

In [18]:
# average profits based on store type
profitByStoreType = df_ProfitVariables.groupby(by = 'store_type').agg(func = 'mean')['Profit']
print('The profit of stores based on the store type: ',profitByStoreType)
The profit of stores based on the store type:  store_type
Deluxe Supermarket     3.944830
Gourmet Supermarket    3.973857
Small Grocery          2.051111
Supermarket            4.004565
Name: Profit, dtype: float64
In [19]:
def createBarPlot(xArray, yArray, title = None, xLable = None, yLabel = None, xRotation = 60):
    '''
    Create a formatted vertical barplot.
    '''
    barPlotObj = sns.barplot(x= xArray, y = yArray, errwidth = 2)
    if not title is None:
        barPlotObj.set_title(title)
    if not xLable is None:
        barPlotObj.set(xlabel=xLable)
    if not yLabel is None:
        barPlotObj.set(ylabel = yLabel)
    barPlotObj.set_ylim(top = max(yArray) * 1.1)
           
    for bar, yVal in zip(barPlotObj.patches, yArray):
        # add text for each bar
        text_x = bar.get_x() + bar.get_width() / 2.0
        text_y = bar.get_height()        
        text = f'{yVal:.3}'
        barPlotObj.text(text_x, text_y, text, fontsize=11, ha='center', va='bottom')
    plt.xticks(rotation=xRotation)
    return barPlotObj


def createBarPlotHorizontal( labelArr,valArr, title = None, xLable = None, yLabel = None):
    '''
    Create a formatted horizontal barplot.
    '''
    barPlotObj = sns.barplot(x= valArr, y = labelArr, errwidth = 2)
    if not title is None:
        barPlotObj.set_title(title)
    if not xLable is None:
        barPlotObj.set(xlabel=xLable)
    if not yLabel is None:
        barPlotObj.set(ylabel = yLabel)
    # formatting
    barPlotObj.set_xlim(0, valArr.values.max() * 1.1)
    for bar, yVal in zip(barPlotObj.patches, valArr.values):
        # add text for each bar
        text_x = bar.get_x() + bar.get_width() + .2
        text_y = bar.get_y() + bar.get_height() / 2
        text = f'{yVal:.3}'
        barPlotObj.text(text_x, text_y, text, fontsize=11, ha='center', va='bottom')
In [20]:
createBarPlotHorizontal(profitByStoreType.index, profitByStoreType, 'Average Profit by Store Type (in million)', 'Store Type', 'Average Profit')

Observation: On average, stores in Bellingham and San Francisco make the least profit.

4.2.4. Average profits for stores by cities:¶

Further, let us understand if our analysis still holds for different types of stores in various cities in the United States.

In [21]:
#Average profits in cities based on store type
profitByCityAndStoreType = df_ProfitVariables.groupby(by = ['store_city', 'store_type']).agg(func = 'mean')['Profit'].sort_values(ascending=False)
print(profitByCityAndStoreType)
store_city     store_type         
Portland       Supermarket            4.048342
Los Angeles    Supermarket            4.002658
Bremerton      Supermarket            4.000219
Seattle        Supermarket            3.989026
Tacoma         Deluxe Supermarket     3.977632
Spokane        Supermarket            3.976626
Beverly Hills  Gourmet Supermarket    3.973857
Salem          Deluxe Supermarket     3.910675
Bellingham     Small Grocery          2.084581
San Francisco  Small Grocery          2.020719
Name: Profit, dtype: float64
In [22]:
createBarPlotHorizontal(profitByCityAndStoreType.index.values, profitByCityAndStoreType, 'Average Profit by City and Store Type (in million)', 'City', 'Average Profit')

Observation: From the above data, we can observe that the Supermarket stores which have relatively smaller area compared to Deluxe Supermarket stores make highest profits across cities.

4.2.5. Average profits for stores by cities and state:¶

Performing a further drillup on the data, let us verify if our previous observation still holds true even at the state level.

In [23]:
profitByStateCityAndStoreType =df_ProfitVariables.groupby(by = ['store_state', 'store_city', 'store_type']).agg(func = 'mean')['Profit'].sort_index(level = 0, ascending= False)
In [24]:
createBarPlotHorizontal(profitByStateCityAndStoreType.index.values, profitByStateCityAndStoreType, \
        'Average Profit by State, City, and Store Type (in million)')

Observation: From the data visualization, we can observe that the previous observation on 'supermarkets' making better profits still holds. From the analysis we could further infer that

  • In Washington(WA), Supermarkets in Bremerton features the highest average profit, which suggests that the population in Bremerton might be greater than all the other cities.
  • Also, Supermarkets in Spokane, Seattle, and Bremerton can be expand and be upgraded to deluxe supermarket(provided space availability), to achieve more profit growth.
  • In Oregon, supermarkets are earning more profits than deluxe supermarkets. Planogram, product availability, better offers might add as a few reasons along with the placement of the store.

4.2.6. Analysing the top premium products in a store based on the location.¶

The goal is to analyse the store attributes based on location, SRP refers to the price set by manufacturers at which the retailers are supposed to sell items in the market. So, we try to determine the average price of a product at different state.

In [25]:
#calculating the premium products in the stores across various cities in multipe states in the United States.
df_store_counts = pd.DataFrame(data.groupby(by = ['store_state','store_city','brand_name']).agg({'SRP':'mean'}))
In [26]:
#sorting and ordering the data 
df_store_counts_sorted = df_store_counts['SRP'].groupby(['store_state','store_city'], group_keys=False)
top_store_srp = pd.DataFrame(df_store_counts_sorted.apply(lambda x:x.sort_values(ascending=False).head(5)))
top_store_srp = top_store_srp.reset_index()
top_store_srp.head()
Out[26]:
store_state store_city brand_name SRP
0 CA Beverly Hills Gauss 3.289412
1 CA Beverly Hills James Bay 3.218182
2 CA Beverly Hills Colossal 3.005833
3 CA Beverly Hills Medalist 2.963158
4 CA Beverly Hills Quick 2.953333
In [27]:
#Visualizing various premium brands at different stores in the united states.
fig = px.bar(top_store_srp, x="brand_name", color="store_state",
             y='SRP',
             title="",
             barmode='group',
             height=700,
             facet_col="store_state"
            )
fig.update_traces(width=1)
fig.show()

Inference - From the graphs, it can be observed that not products of all brands are being sold in all the states.
Also, it can be observed that the average price set by manufacturers varies across the different states.
It can also be observed that Washington has the highest set average SRP across all the stores than the remaining states.

Observation - As depicted, the price of products is maximum in Washington, when compared to the prices at Orlando and California. Hence, it can be inferred that the maximum profits are bing driven from the state of Washington, it can be suggested that the brands increase their prices at the other states so as to gain better profitability. Furthermore, some brands needs much better marketing and sales strategies to estbalish themselves amongst the top.

4.3. Analysis on store promotion strategies¶

The given dataset has several promotion strategies that users use to redeem while shopping. These promotion strategies vary according to cities and states. Each promotion strategy is associated with the corresponding media type used to promote the coupons and its associated cost for the advertisement. We are analyzing which promotion type is famous for each city and based on city properties we could infer which promotion type can be extended to other cities. We have data for three states (Washington, California and Oregon) and few cities in them.

In [28]:
# Finding the promotion count based on state and city
count = data.groupby([data.index,'promotion_name']).count()
count.reset_index(inplace = True)
count[['store_state', 'store_city']] = pd.DataFrame(count['level_0'].tolist(), index=count.index)
count = count[['store_state','store_city', 'promotion_name', 'food_category']]
count.rename({'food_category' : 'promotion_count'}, axis = 'columns', inplace = True)
# Here, the count dataframe stores all the count of various promotions used by users across cities.
count.head()
Out[28]:
store_state store_city promotion_name promotion_count
0 CA Beverly Hills Best Savings 174
1 CA Beverly Hills Big Promo 151
2 CA Beverly Hills Big Time Savings 80
3 CA Beverly Hills Bye Bye Baby 115
4 CA Beverly Hills Double Down Sale 127
In [29]:
# Finding the max prmotion used
count.loc[count['promotion_count'].idxmax()]
Out[29]:
store_state                           OR
store_city                         Salem
promotion_name     Cash Register Lottery
promotion_count                     1555
Name: 82, dtype: object
In [30]:
import plotly.express as px

# Plotting a donut chart for promotion types acroos states and cities
fig = px.sunburst(count, path=['store_state', 'store_city', 'promotion_name'], values='promotion_count',
                  hover_data=['promotion_count'])
fig.update_layout(
    autosize=False,
    width=800,
    height=800,
    title="Promotions used across States and Cities<br><sup>(Hover over the components to access the promotion count)</sup>")
fig.show()

The above displayed hierarchical donut pie-chart shows that promotions are highly used in Washington, followed by Oregon and then California. Following table shows which promotion types are frequently used in various cities -

Sr No. State Cities Promotion Type Promotion Count
1 Oregon Salem Cash Register Lottery 1555
2 Washington Seatlle Weekend Markdown 904
3 Oregon Portland Free For All 645
4 California Los Angeles High Roller Savings 639
5 Washington Tacoma Super Duper Savers 622
6 Washington Spokane Sales Days 512
7 California Beverly Hills Two Day Sale 482
8 Washington Bremerton Shelf Clearing Days 445
9 California San Francisco Savings Galore 86
10 Washington Bellingham Coupon Spectacular 64

Inference: Cash Register Lottery is the highest used promotion in the Salem city. But, this type of promotion is not being used in any other city. For example, Seattle, Tacoma and Spokane are having huge customer interactions which uses promotions, however, Cash Register Lottery is not being used. This specific type of promotion can be extended to these cities, by doing this store can be profitable. Similar can be done other way round, where Weekend Markdown can be pushed to cities in Oregon and California. All leading promtion types can be applied to other cities for users to redeem maximum savings and stores to make maximum profits.

4.4. Customer behavior based on gender and marital status and which food category to promote?¶

This dataset contains sufficient features to analyze the customers and their shopping preferences. For example, which customers (with or without kids) are prone to buy certain membership of the store. By looking at the relevant featuers, we'll derive insights and inferences on customer behavior. We'll also extend this analysis by probing into two pairs of correlation, one between food category and store location, the other between food category and store location. Here, we'll analyze how consumer preference for food is affected based on the location, and which food type is popular and based on the customer martial status, what food categories can be pushed ahead to increase the sales.

In [31]:
#Calling a function to find whether observation have children or not
def children(value):
    if value >= 1:
        return "Y"
    else:
        return "N"
    
#Creating a new dataframe column for above observation
data['children'] = data['total_children'].map(children)
display(data.head())
food_category food_department food_family store_sales(in millions) store_cost(in millions) unit_sales(in millions) promotion_name marital_status gender total_children ... meat_sqft coffee_bar video_store salad_bar prepared_food florist media_type cost Profit children
store_state store_city
WA Bellingham Breakfast Foods Frozen Foods Food 3.28 1.3776 2 Coupon Spectacular S F 1 ... 2374 1 0 0 0 0 Daily Paper 90.58 1.9024 Y
Bellingham Breakfast Foods Frozen Foods Food 1.36 0.4488 1 Super Savers M F 4 ... 2374 1 0 0 0 0 TV 114.98 0.9112 Y
Bellingham Breakfast Foods Frozen Foods Food 1.87 0.6171 1 Sales Galore M M 1 ... 2374 1 0 0 0 0 Daily Paper, Radio 109.03 1.2529 Y
Bellingham Bread Baked Goods Food 3.93 1.8078 1 Double Your Savings S M 4 ... 2374 1 0 0 0 0 Street Handout 122.16 2.1222 Y
Bellingham Bread Baked Goods Food 1.06 0.5194 2 Wallet Savers M F 3 ... 2374 1 0 0 0 0 TV 139.36 0.5406 Y

5 rows × 34 columns

In [32]:
# getting the count of the marital status
data.marital_status.value_counts()
Out[32]:
S    19793
M    19099
Name: marital_status, dtype: int64
In [33]:
# getting the count of the gender value
data.gender.value_counts()
Out[33]:
M    19692
F    19200
Name: gender, dtype: int64
In [34]:
# getting the count of the customers with and without kids
data.children.value_counts()
Out[34]:
Y    35307
N     3585
Name: children, dtype: int64
In [35]:
# getting the count of number of kids
data['total_children'].value_counts()
Out[35]:
1    8015
4    7982
2    7875
3    7730
5    3705
0    3585
Name: total_children, dtype: int64
In [36]:
#To obtain new dataframe for demographics
data_customer_behavior = data.groupby(by=['gender','marital_status','children'])
In [37]:
#Having a detail overview for our focused data
for name, group in data_customer_behavior:
    print(name, len(group))
('F', 'M', 'N') 910
('F', 'M', 'Y') 8700
('F', 'S', 'N') 713
('F', 'S', 'Y') 8877
('M', 'M', 'N') 932
('M', 'M', 'Y') 8557
('M', 'S', 'N') 1030
('M', 'S', 'Y') 9173
In [38]:
#creating a dictionary
demographic_dictionary = {'Category':[], 'Occurence':[]}

#Appending the dictionary
for name,group in data_customer_behavior:
    demographic_dictionary['Category'].append(str(name))
    demographic_dictionary['Occurence'].append(len(group))
In [39]:
#Creating a dataframe for visualization
df_of_customer = pd.DataFrame(demographic_dictionary)
df_of_customer
Out[39]:
Category Occurence
0 ('F', 'M', 'N') 910
1 ('F', 'M', 'Y') 8700
2 ('F', 'S', 'N') 713
3 ('F', 'S', 'Y') 8877
4 ('M', 'M', 'N') 932
5 ('M', 'M', 'Y') 8557
6 ('M', 'S', 'N') 1030
7 ('M', 'S', 'Y') 9173
In [40]:
#creating a virtual plot
import plotly.express as px
fig = px.bar(df_of_customer, x="Category", y="Occurence", color="Occurence", barmode="group", title="Distribution of customers based on number of children and marital status <br> (F=Female, M=Male, N=No Children, Y=Atleast one child, S=Single)")
fig.show()

Observation - From the plot, it can be observed that irrespective of marital status and gender, the segment with most number of people visiting is the segment of child-bearers
Furthermore, there is massive difference between customers who are child-bearers and customers without children. Across the segments of child-bearers married customers are less compared to single customers.
Inference - For generating demographics related analysis, three variables were used age, marital status and child-bearers
It can be inferred that retail store hosts more child-bearers customers and business strategies has to influneced based on this observation. It is in a way most important insight a retail chain can have as children have considerable buying power as well as they influence their parent's buying power.

4.5. Customer yearly income analysis based on customer featuers¶

In the following snippets we want to see if there is any customer feature that could be predicted by another set of customer features. The first customer feature that we wish to predict is the customer yearly income. Predicting the customer yearly income is of great commercial value because with such a predictive model it becomes rather easy for us to group our customers into different income groups, whereby we could apply customized and more effective promotion strategies to them. For example, in case of Macy's, when a customer tries to apply for a membership for being an exclusive member of the store, the customer financial profile is scanned and then it is decided whether the customer is eligible for the membership. Similarly, having user yearly income prediction based on customer features will help management to ease the data collection and predict the customer eligiblity better for exculsive membership.

In [41]:
def dfMultiMergeByIndex(dfArr):
    '''
    merge multiple dataframes based on their index columns
    '''
    isFirst = True
    res = None
    for df in dfArr:
        if isFirst:
            res = df
            isFirst = False
        else:
            res = pd.merge(res, df, right_index=True, left_index=True)
    return res
In [42]:
# fetching only customer related features
customerData = data.iloc[:, 7:17].reset_index(drop= True)  # retrieve all the columns that represent customer features
customerData.head()
Out[42]:
marital_status gender total_children education member_card occupation houseowner avg_cars_at home(approx) avg. yearly_income num_children_at_home
0 S F 1 Bachelors Degree Bronze Professional Y 2 $50K - $70K 0
1 M F 4 High School Degree Bronze Manual Y 4 $30K - $50K 4
2 M M 1 Partial High School Normal Skilled Manual Y 1 $10K - $30K 0
3 S M 4 Graduate Degree Bronze Professional N 2 $70K - $90K 0
4 M F 3 Bachelors Degree Golden Management Y 2 $50K - $70K 3
In [43]:
# converrt binary vars into 0-1 vars
customerData.marital_status = customerData.marital_status.map(lambda v : 0 if v =='S' else 1) 
customerData.gender = customerData.gender.map(lambda v : 0 if v =='F' else 1) 
customerData.houseowner = customerData.houseowner.map(lambda v : 0 if v =='N' else 1) 

# convert categorical vars into dummy vars
eduDummy = pd.get_dummies(customerData.education)
memDummy = pd.get_dummies(customerData.member_card)
ocuDummy = pd.get_dummies(customerData.occupation)

yearlyIncome = customerData['avg. yearly_income']
customerDataKNN = customerData.drop(columns = ['education', 'member_card', 'occupation', 'avg. yearly_income'])
customerDataKNN = dfMultiMergeByIndex([customerDataKNN, eduDummy, memDummy, ocuDummy])
customerDataKNN.head(5)
Out[43]:
marital_status gender total_children houseowner avg_cars_at home(approx) num_children_at_home Bachelors Degree Graduate Degree High School Degree Partial College Partial High School Bronze Golden Normal Silver Clerical Management Manual Professional Skilled Manual
0 0 0 1 1 2 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0
1 1 0 4 1 4 4 0 0 1 0 0 1 0 0 0 0 0 1 0 0
2 1 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1
3 0 1 4 0 2 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
4 1 0 3 1 2 3 1 0 0 0 0 0 1 0 0 0 1 0 0 0
In [44]:
xArr = customerDataKNN
yArr = yearlyIncome
# split the dataset into training and testing subsets
xTrainKNN, xTestKNN, yTrainKNN, yTestKNN = train_test_split(
    xArr.values, yArr.values, random_state=11)  
# defining the KNN classifier for yearly income prediction
knnEstimatorIncome = KNeighborsClassifier()
knnEstimatorIncome.fit(X=xTrainKNN, y=yTrainKNN)
Out[44]:
KNeighborsClassifier()
In [45]:
predictedKNN = knnEstimatorIncome.predict(X=xTestKNN)
expectedKNN = yTestKNN

# getting the accuracy of the prediction model
print(f'The accuracy of the prediction model is: {knnEstimatorIncome.score(xTestKNN, yTestKNN):.2%}')
The accuracy of the prediction model is: 90.80%
In [46]:
predAndExpected = [(x,y) for x,y in zip(predictedKNN, expectedKNN)]
sampleResult = random.sample(predAndExpected, 10)
# running the prediction model on the testing data
print('Testing the prediction model on the testing data -')
for x,y in sampleResult:
    print(f'predicted: {x}' , '  \t', f'expected: {y}')
Testing the prediction model on the testing data -
predicted: $50K - $70K   	 expected: $50K - $70K
predicted: $70K - $90K   	 expected: $70K - $90K
predicted: $70K - $90K   	 expected: $50K - $70K
predicted: $70K - $90K   	 expected: $70K - $90K
predicted: $10K - $30K   	 expected: $10K - $30K
predicted: $50K - $70K   	 expected: $50K - $70K
predicted: $10K - $30K   	 expected: $10K - $30K
predicted: $50K - $70K   	 expected: $50K - $70K
predicted: $30K - $50K   	 expected: $30K - $50K
predicted: $30K - $50K   	 expected: $50K - $70K
In [47]:
yearlyIncome.value_counts().plot(kind = 'bar', figsize = (10,6), rot = 0, title = 'Histogram of Customer Income Group', xlabel = 'Annual income group', ylabel = 'Number of customers')
Out[47]:
<AxesSubplot:title={'center':'Histogram of Customer Income Group'}, xlabel='Annual income group', ylabel='Number of customers'>
Interpretation of the KNN Model for Predicting Customer Yearly Income¶

The model features relatively high precision rate, it is a valid model. And for this validity, we can conclude that a customer's yearly income could be predicted by his/her marital status, gender, total children, education, member card, occupation, real estate ownership, average cars at home, number of children at home.

Suggestion to the mangement: Based on this model, the management could understand the customers well before they have information on their annual income. Using this, they can target consumers with higher yearly income with exclusive featuers like free home delivery, same day home delivery, etc. By apply this strategy, management will be in position to lock-in these customers for a longer period of time once they buy the exclusive subscription.

Visualization of the KNN Model¶

To try to display our data on a three-dimensional graph, we perform primary component analysis to reduce the high dimensional input data into a set of two-D vectors. We then plot scatter plot for both the prediction and the expected values.

In [48]:
priCompAnalyzer3D = decomposition.PCA(3)
priCompAnalyzer3D.fit(xTrainKNN)
Out[48]:
PCA(n_components=3)
In [49]:
xTestKNN3D = priCompAnalyzer3D.transform(xTestKNN)
xTestKNN3D.shape
Out[49]:
(9723, 3)
In [50]:
knnResult3D = pd.DataFrame(xTestKNN3D, columns = ['x', 'y', 'z'])
knnResult3D['predicted'] = predictedKNN
knnResult3D['expected'] = expectedKNN


incomeGroupToColor = {income : color for income, color in zip(np.unique(knnResult3D.predicted), ['c', 'b', 'g', 'r', 'm', 'y', 'k', 'w'])}
knnResult3D['colorPredicted'] = knnResult3D.predicted.map(incomeGroupToColor)
knnResult3D['colorExpected'] = knnResult3D.expected.map(incomeGroupToColor)


rowSampler = random.sample([i for i in range(len(knnResult3D))], 1000)
knnResult3DSample = knnResult3D.iloc[rowSampler, :]
In [51]:
px.scatter_3d(data_frame= knnResult3DSample, x = 'x', y = 'y', z = 'z', color= 'predicted', title = 'Predicted Customer Income Group')
In [52]:
px.scatter_3d(data_frame= knnResult3DSample, x = 'x', y = 'y', z = 'z', color= 'expected', title = 'Expected Customer Income Group')

4.6. Prediction of Customer Acquisition Cost using customer details¶

In order to predict the cost of acquiring customers, we look at the different kinds of customers we have and use there features as predictors. Our hope is to find some dependence of the customer acquisition costs on the various features of the customers. These features include their marital status, sex, the number of children they have, their occupation, yearly income range, education level, whether they have their own home and what kind of loyalty card they have.

By looking at these features, we hypothesize that there may be some relation between them and their corresponding customers' cost of acquisition. At the moment, we do not know whether such a relation exists and if it does, how it impacts our costs. Since we are focussing exclusively on customer details, we will be filtering our original dataframe to only have those variables which are relevant for customer details. Following that, we will be creating dummy variables for each categorical variable. This means that, for each category of a categorical variable, there will be a column that indicates whether the row belongs to that category or not.

We then attempt to predict the customer acquisition costs by creating three regression models. The three models are:

1) Ordinary Least Squares or Linear Regression 2) Decision Tree Regressor 3) XGBoost Regressor

In [53]:
#creating a list of customer features
customer_columns = ['marital_status', 'gender',
       'total_children', 'education', 'occupation',
       'houseowner', 'avg_cars_at home(approx)', 'avg. yearly_income', 'member_card', 'cost']
In [54]:
#creating a new data frame for customers
df_customer = data[customer_columns].copy(deep= True)
In [55]:
#identifying categorical variables for making dummy variables
categorical_columns = ['marital_status', 'gender', 'education', 'occupation',
       'houseowner', 'avg. yearly_income', 'member_card']
In [56]:
#showing all categories of all categorical variables
val_dict = {}
for col in categorical_columns:
    val_dict.update({col : df_customer[col].value_counts().index})
In [57]:
val_dict
Out[57]:
{'marital_status': Index(['S', 'M'], dtype='object'),
 'gender': Index(['M', 'F'], dtype='object'),
 'education': Index(['High School Degree', 'Partial High School', 'Bachelors Degree',
        'Partial College', 'Graduate Degree'],
       dtype='object'),
 'occupation': Index(['Professional', 'Skilled Manual', 'Manual', 'Management', 'Clerical'], dtype='object'),
 'houseowner': Index(['Y', 'N'], dtype='object'),
 'avg. yearly_income': Index(['$30K - $50K', '$10K - $30K', '$50K - $70K', '$70K - $90K',
        '$130K - $150K', '$110K - $130K', '$90K - $110K', '$150K +'],
       dtype='object'),
 'member_card': Index(['Bronze', 'Normal', 'Golden', 'Silver'], dtype='object')}
In [58]:
#creating dummy variables for each categorical variable
df_customer = pd.get_dummies(df_customer, columns= categorical_columns, drop_first= True)
df_customer.head()
Out[58]:
total_children avg_cars_at home(approx) cost marital_status_S gender_M education_Graduate Degree education_High School Degree education_Partial College education_Partial High School occupation_Management ... avg. yearly_income_$110K - $130K avg. yearly_income_$130K - $150K avg. yearly_income_$150K + avg. yearly_income_$30K - $50K avg. yearly_income_$50K - $70K avg. yearly_income_$70K - $90K avg. yearly_income_$90K - $110K member_card_Golden member_card_Normal member_card_Silver
store_state store_city
WA Bellingham 1 2 90.58 1 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
Bellingham 4 4 114.98 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
Bellingham 1 1 109.03 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
Bellingham 4 2 122.16 1 1 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
Bellingham 3 2 139.36 0 0 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 24 columns

In [59]:
#Seperating the independent and dependent variables
X = df_customer.drop(columns= ['cost'])
X = sm.add_constant(X)                        #adding a constant to the independent variable to get intercept of linear regression line
y = df_customer['cost']
In [60]:
#Linear Regression / OLS model
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
ols_results.summary()
Out[60]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.003
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 5.930
Date: Mon, 05 Dec 2022 Prob (F-statistic): 4.70e-18
Time: 19:57:36 Log-Likelihood: -1.8795e+05
No. Observations: 38892 AIC: 3.760e+05
Df Residuals: 38868 BIC: 3.762e+05
Df Model: 23
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 102.2828 1.645 62.171 0.000 99.058 105.507
total_children -0.0619 0.108 -0.574 0.566 -0.273 0.149
avg_cars_at home(approx) 0.8919 0.172 5.179 0.000 0.554 1.229
marital_status_S 0.6103 0.330 1.848 0.065 -0.037 1.257
gender_M -0.5850 0.310 -1.890 0.059 -1.192 0.022
education_Graduate Degree -0.2079 0.798 -0.261 0.794 -1.771 1.355
education_High School Degree -3.3260 0.688 -4.834 0.000 -4.675 -1.977
education_Partial College -0.5647 0.732 -0.772 0.440 -1.999 0.870
education_Partial High School -2.1937 0.730 -3.006 0.003 -3.624 -0.763
occupation_Management -2.5200 1.411 -1.786 0.074 -5.286 0.246
occupation_Manual -2.6514 1.419 -1.868 0.062 -5.433 0.130
occupation_Professional -3.9637 1.391 -2.849 0.004 -6.691 -1.237
occupation_Skilled Manual -2.6412 1.417 -1.865 0.062 -5.418 0.135
houseowner_Y 0.6220 0.336 1.853 0.064 -0.036 1.280
avg. yearly_income_$110K - $130K -2.3128 1.248 -1.853 0.064 -4.759 0.134
avg. yearly_income_$130K - $150K -2.2188 1.208 -1.836 0.066 -4.587 0.150
avg. yearly_income_$150K + 0.3822 1.510 0.253 0.800 -2.578 3.343
avg. yearly_income_$30K - $50K -0.6120 0.890 -0.688 0.491 -2.356 1.132
avg. yearly_income_$50K - $70K -1.8788 0.980 -1.917 0.055 -3.800 0.042
avg. yearly_income_$70K - $90K -2.9283 1.061 -2.760 0.006 -5.008 -0.849
avg. yearly_income_$90K - $110K 1.6155 1.262 1.280 0.201 -0.858 4.089
member_card_Golden -0.2953 0.529 -0.559 0.576 -1.331 0.741
member_card_Normal -0.7922 0.683 -1.161 0.246 -2.130 0.546
member_card_Silver -1.2319 0.580 -2.124 0.034 -2.369 -0.095
Omnibus: 836571.603 Durbin-Watson: 1.854
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2941.472
Skew: 0.005 Prob(JB): 0.00
Kurtosis: 1.653 Cond. No. 77.5


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [61]:
#creating a function for VIF analysis and variable selection
def get_vif(train_dataframe):
    
    vif_dict, tolerance_dict = {}, {}

    # create formula for each exogenous variable
    for col in train_dataframe.drop(columns = ['const']) :
        y = train_dataframe[col]
        X = train_dataframe.drop(columns = [col])
        

        # extract r-squared from the fit
        r_squared = sm.OLS(y,X).fit().rsquared

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[col] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[col] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif
In [62]:
get_vif(X)
Out[62]:
VIF Tolerance
total_children 1.076095 0.929286
avg_cars_at home(approx) 1.602988 0.623835
marital_status_S 1.147374 0.871556
gender_M 1.008868 0.991210
education_Graduate Degree 1.337053 0.747913
education_High School Degree 4.168448 0.239897
education_Partial College 1.823190 0.548489
education_Partial High School 4.687363 0.213340
occupation_Management 10.426888 0.095906
occupation_Manual 15.845654 0.063109
occupation_Professional 18.034810 0.055448
occupation_Skilled Manual 16.212233 0.061682
houseowner_Y 1.129459 0.885380
avg. yearly_income_$110K - $130K 2.645354 0.378021
avg. yearly_income_$130K - $150K 3.457997 0.289185
avg. yearly_income_$150K + 1.989750 0.502576
avg. yearly_income_$30K - $50K 7.280988 0.137344
avg. yearly_income_$50K - $70K 5.698303 0.175491
avg. yearly_income_$70K - $90K 5.134349 0.194767
avg. yearly_income_$90K - $110K 2.661621 0.375711
member_card_Golden 1.257513 0.795221
member_card_Normal 3.556942 0.281140
member_card_Silver 1.130706 0.884403
In [63]:
ols_predictions = ols_results.fittedvalues
In [64]:
#plotting actual vs predicted values 
fig = sns.scatterplot(x = df_customer['cost'], y= ols_predictions)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of OLS model')
Out[64]:
Text(0.5, 1.0, 'Actual vs Predicted values of OLS model')

4.6.1 Ordinary Least Squares Model Performance¶

The Ordinary Least Squares model is the simplest model that we have used, and unfortunately, it has performed worse than we expected. The results of the model show that it has an R squared value of 0.003, a value which should be ideally close to 1. This means that the model has done a really poor job of explaining the variations in the data. Apart from this, a lot of other metrics, like Omnibus and the F-statistic are far from their ideal values.

This means that a model as simple as linear regression won't work for our data. This is probably because there a lot of categorical variables in the data and linear regression works best with numerical data. Another reason for such a low score may be because of high correlation between some predictor variables. To check for this, we use VIF analysis and see that the 'occupation' variable is having too large of a VIF score. However, even after dropping that variable, there is absolutely no difference in the results. So, we have chosen not to show the second iteration of the OLS model.

In hopes of improving our model, we now move to DecisionTree Regreessor which is a bit more complex than Linear Regression.

In [65]:
#dropping the additional 'const' column
X.drop(columns = ['const'], inplace= True)
X.head()
Out[65]:
total_children avg_cars_at home(approx) marital_status_S gender_M education_Graduate Degree education_High School Degree education_Partial College education_Partial High School occupation_Management occupation_Manual ... avg. yearly_income_$110K - $130K avg. yearly_income_$130K - $150K avg. yearly_income_$150K + avg. yearly_income_$30K - $50K avg. yearly_income_$50K - $70K avg. yearly_income_$70K - $90K avg. yearly_income_$90K - $110K member_card_Golden member_card_Normal member_card_Silver
store_state store_city
WA Bellingham 1 2 1 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
Bellingham 4 4 0 0 0 1 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
Bellingham 1 1 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0
Bellingham 4 2 1 1 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
Bellingham 3 2 0 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 23 columns

In [66]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)
In [67]:
#DecisionTree Regressor
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(criterion= 'mse')
dt_reg.fit(X_train, y_train)
Out[67]:
DecisionTreeRegressor(criterion='mse')
In [68]:
#making an array of predictions
y_pred = dt_reg.predict(X_test)
In [69]:
from sklearn.metrics import r2_score, mean_squared_error
In [70]:
#printing the metrics of DecisionTree model
print(f'R2 score: {r2_score(y_test, y_pred): 0.3f}')
print(f'MSE: {mean_squared_error(y_test, y_pred): 0.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)): 0.3f}')
R2 score:  0.250
MSE:  698.290
RMSE:  26.425
In [71]:
#plotting actual vs predicted model  
sns.scatterplot(x = y_test, y= y_pred)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of DecisionTree Regressor')
Out[71]:
Text(0.5, 1.0, 'Actual vs Predicted values of DecisionTree Regressor')

4.6.2. Decision Tree Regressor Model Performance¶

The performance of the Decision Tree model is not as good as we would have liked it to be, but it is exponentially better than the OLS model. We see an improvement of the R squared value from 0.003 to 0.240. A close look at the scatter plot of 'Actual values vs Predicted values' reveals a straight line going though the origin, but a lot of other points covering it. Other error metrics like Mean Squared Error and Root Mean Squared error are also quite tolerable.

An RMSE value of 26.59 means that the average difference between an actual value of cost and the cost as predicted by the model is around $26. For a grocery chain, running on thin margins, this difference is quite large. We need to find ways to improve the quality of predictions further. So, we move to a slightly more complex model - XGBoost.

In [72]:
#XGBoost Regressor
from xgboost import XGBRegressor
import xgboost as xgb
xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)
Out[72]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=16,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
In [73]:
#Plotting 10 most important factors in predicting cost as used by the XGB Regressor model
xgb_fig = xgb.plot_importance(xgb_model, max_num_features= 10)
In [74]:
#creating an array for predictions
y_pred = xgb_model.predict(X_test)
In [75]:
#printing metrics of XGB Regressor model
print(f'R2 score: {r2_score(y_test, y_pred): 0.2f}')
print(f'MSE: {mean_squared_error(y_test, y_pred): 0.2f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)): 0.2f}')
R2 score:  0.17
MSE:  772.32
RMSE:  27.79
In [76]:
#plotting actual vs predicted values
sns.scatterplot(x = y_test, y = y_pred)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of XGBRegressor')
Out[76]:
Text(0.5, 1.0, 'Actual vs Predicted values of XGBRegressor')

4.6.3. XGBoost Regressor Model Performance¶

Surprisingly, the Decision Tree model performed slightly better than the XGBoost mdoel. The R squared value for the latter is 0.17 and MSE and RMSE are 765.31 and 27.66. The inner mathematics of XGBoost are quite complex, however, we can visualize the importance that the mdoel gave to each of our features, in case there is an insight that can be drawn from them. On making this plot, we see that the model gave most importance to the 'Total Number of Children' variable, which aligns with our previous finding that our core customer segment is people with children. This means that there may really be a strong relation among the number of children our customers have and the customer acquisition cost.

However, at the end, we would choose to go with the DecisionTree Regressor model if we had to do any prediction using this exact dataset.

4.6.4. Conclusion of Cost Prediction using Customer Features¶

Although all models performed differently and gave us different results, none of them were as good as we would have liked them to be. Our best working model is still $26 off in predicting the cost of customer acquisition, which is quite a lot. In order to improve the performance of our models, we can take a few steps, some of which are listed below:

1) We can change the proportion of training and testing subsets of our data. 2) We can use cross validation to test the performance of our model on new data, in a better way. 3) We can tune hyper-parameters of the models, since we have just used the default values. 4) Most importantly, we can urge management to focus on expanding the dataset, to include more customer features that are more relevant to our usecase, such as time spent in the store, aisles visited, how customers respond to various promotional deals, etc. This will provide better insights into customer behaviour and make predicting customer acquisition costs easier.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: